A Scalable, Distributed Web-Crawler*

نویسندگان

  • Ankit Jain
  • Abhishek Singh
  • Ling Liu
چکیده

In this paper we present a design and implementation of a scalable, distributed web-crawler. The motivation for design of such a system to effectively distribute crawling tasks to different machined in a peer-peer distributed network. Such architecture will lead to scalability and help tame the exponential growth or crawl space in the World Wide Web. With experiments on the implementation of the proto-type of the system we derive show the scalability and efficiency of such a system.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Trovatore: Towards a Highly Scalable Distributed Web Crawler

Trovatore is an ongoing project aimed at realizing an efficient distributed and highly scalable web crawler. This poster illustrates the main ideas behind its design.

متن کامل

UbiCrawler: a scalable fully distributed Web crawler

We present the design and implementation of UbiCrawler, a scalable distributed web crawler, and we analyze its performance. The main features of UbiCrawler are platform independence, fault tolerance, a very effective assignment function for partitioning the domain to crawl, and more in general the complete decentralization of every task.

متن کامل

Design and Implementation of an Efficient Distributed Web Crawler with Scalable Architecture

Distributed Web crawlers have recently received more and more attention from researchers. Centralized solutions are known to have problems like link congestion, being a single point of failure ,while the fully distributed crawlers become an interesting architectural paradigm for its scalability, increased autonomy of nodes. This paper provides a distributed crawler system which consists of mult...

متن کامل

Design, implementation and experiment of a YeSQL Web Crawler

We describe a novel, “focusable”, scalable, distributed web crawler based on GNU/Linux and PostgreSQL that we designed to be easily extendible and which we have released under a GNU public licence. We also report a first use case related to an analysis of Twitter’s streams about the french 2012 presidential elections and the URL’s it contains.

متن کامل

GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources

With the advance of the World-Wide Web (WWW) technology, people can easily share content on the Web, including geospatial data and web services. Thus, the “big geospatial data management” issues start attracting attention. Among the big geospatial data issues, this research focuses on discovering distributed geospatial resources. As resources are scattered on the WWW, users cannot find resource...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000